SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models

Accurate inference of orthologous genes constitutes a prerequisite for comparative and evolutionary genomics. SonicParanoid is one of the fastest tools for orthology inference; however, its scalability and accuracy have been hampered by time-consuming all-versus-all alignments and the existence of proteins with complex domain architectures. Here, we present a substantial update of SonicParanoid, where a gradient boosting predictor halves the execution time and a language model doubles the recall. Application to empirical large-scale and standardized benchmark datasets shows that SonicParanoid2 is much faster than comparable methods and also the most accurate. SonicParanoid2 is available at https://gitlab.com/salvo981/sonicparanoid2 and https://zenodo.org/doi/10.5281/zenodo.11371108. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03298-4.


Supplementary Text:
Outline of SonicParanoid's graph-based algorithm.SonicParanoid [31], described in Cosentino and Iwasaki (2019), predicts the ortholog genes between two species using a substantially modified algorithm from the approach used in the original InParanoid, where seed orthologs are identified using the bidirectional best hit (BBH).Next, the sequence similarities obtained for the seed orthologs are compared to those obtained for the intra-species alignments to identify candidate in-paralogs.One difference with the original InParanoid is in the scoring function used to identify the inparalogs.SonicParanoid uses a function that exponentially penalize candidate orthologs which length is considerably different from the orthologs obtained using the BBH.
The candidate orthologs are then clustered using a greedy algorithm into clusters orthologs shared amongst pairs of species.SonicParanoid also integrates the ability to predict orthogroups (orthologs shared amongst multiple species), and in recent versions it uses MCL instead of the original single linkage clustering.Because SonicParanoid was published as an application note, which limits the length of the manuscript to only one page, the details cannot be found in the main text of the paper.An extensive description of the SonicParanoid algorithm, as well as additional figures, can be found in the supplementary data of Cosentino and Iwasaki [31] (https://oup.silverchaircdn.com/oup/backfile/Content_public/Journal/bioinformatics/35/1/10.1093_bioinformatics_bty631/3/bioinformatics_35_1_149_s1.pdf?Expires=1714522188&Signature=FwuT~W0zcKAezVxV3hK6Skim1HBoVjRzZ4R7Ew3PxyS 6WS-nQCy8O89-xGDPjKJwXr7LxjxMsqR99F4mT91e-fTitMP8myRNq57DnzJpXgcOsSJz298nBxLIYuqF1fFUabZuijSG2a-kWOafLK92X2SK1645zFT~R7v7nbpYeoQXdIvCFcR0j8M1bPUdfFyEYPG3c9vesk61GJCvp IFti5z5iMHje6mtzcN7z6DJzsrNfISX2hgTIj9cTmYVO-5CKhgqS0zVaFpVArsRwMGZ1Hq0DLOlAgrwrJ5FDjeuq6At58sKl8b3Fovg1ioPzXkU~0WVN qk0H4biv8UGCLNcaQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA).

CPU-time, wall-time, and memory estimation
CPU-time and wall-time for each run shown in this study were computed using hyperfine (https://github.com/sharkdp/hyperfine).We set the "--runs" parameter to 1 to instruct hyperfine to perform a single a run.The JSON files generated by hypefine are stored in https://gitlab.com/salvo981/sonic-manuscript/genome_biology_review_files/round1/logs.
For each run the CPU-time is computed as the sum of user and system time.Following is the template command used for hypefine: hyperfine --style none --runs 1 -export-json hyperfine.results.json"<command to run>".<command to run> can be any of the runs using any of the tested tools (e.g.SonicParanoid).Memory estimation was performed using memory_profiler.py(https://pypi.org/project/memory-profiler/).In order to let the tool take into account the memory shared by the forked processes (or child processes), we use the psutil_pss backend instead of the deafult (psutil).Using psutil would simply perform a cumulative sum of the memory that forked job access, even when it is shared.This might lead to not realistic memory peak estimations.mprof run --nopython --backend psutil_pss --multiprocess -C -output memory_usage.dat<command to run>.We also use the "peak" sub-command to extract the memory usage peaks from the .datfiles.All the files, including the .datfiles with the estimated memory peaks are stored under https://gitlab.com/salvo981/sonic-manuscript/genome_biology_review_files/round1/logs/memory-profiling-using-qfo/ ERROR found line with forbidden symbols in '/ssd_home/sp2-genome-biology-review-runs/input/reference_2000_mags/2156126005_1.faa' that is '*' (/^[^a-z#>]/i) full line: *

Data sources for the proteomes
We have provided the tables listing the UniProt proteome IDs for the 200 eukaryotes dataset, which was used to evaluate the tools, and the 250 eukaryotes-prokaryotes dataset, which was used to train the AdaBoost model for selecting the faster inter-proteome alignment to perform, as Additional File 2: Table S13 and Additional File 3: S14, respectively.S3).As in Figure S9, the execution time difference is proportional to the difference in proteome size (top row) and inversely proportional to AAI (bottom row).For each possible combination of N input proteomes, the fastest inter-proteome alignments are predicted using an AdaBoost binary classifier (left branch).After the fastest alignments are performed, they are processed and the essential protein sets are generated, as in Equation 2. The remaining inter-proteome alignments are performed on the reduced protein sets.The intra-proteome (right branch) and the predicted fastest inter-proteome alignments are performed using the complete input proteomes.1a), whereas those shown by filled markers include orthologs predicted from the domain-based algorithm (Figure 1b).Settings described in Supplementary Table S3 were used.a, Accuracy results obtained from generalized species tree discordance tests.b, Prediction accuracies obtained from functional benchmark tests.1a), whereas those shown by filled markers include orthologs predicted from the domain-based algorithm (Figure 1b).Settings described in Supplementary Table S3 were used.Results obtained from a, gene tree-based tests and b, species tree discordance tests.

Fig S1 :
Fig S1: QfO 2020 test datasets.Benchmark dataset from the QfO consortium.The dataset comprises 78 proteomes with an average size of 5.63 million AA.

Fig S2 :
Fig S2: Scalability of domain-based pipeline.Plots on the right and left refer to trials with QfO and MAG datasets as input, respectively.Multiple trials were performed using the domain-based pipeline with increasing numbers of input proteomes from both datasets (xaxis).Domain-based pipeline shows high scalability with a quasi-linear execution time growth as the number of input proteomes increases.a, Increase in execution time for profile searches and Doc2Vec model training.Total execution time for both steps was approximately 6 min for the complete QfO dataset and approximately 18 min (only 3X longer) for the 2,000 MAGs.b, Time required to train the Doc2Vec models increases linearly with the corpus size.c, Size of vocabulary (single words in the corpus) reaches a plateau when the input proteomes have similar contents (e.g. for the MAG dataset).

Fig
Fig S3: 2,000 MAGs dataset.Bacterial MAGs from terrestrial and aquatic environments randomly selected from a catalogue of 52,515 MAGs described by Nayfach et al. 2021.

Fig S4: 200
Fig S4: 200 Eukaryote dataset.The files in the dataset are all reference proteomes obtained from Uniprot.

Fig S5 :
Fig S5: Accuracy of SonicParanoid2 with and without domain-based orthology.SP2 runs denoted with "g" were performed using the graph-based algorithm.Methods shown by markers on the Pareto frontier demonstrate the best balance between precision and recall.Square markers represent accuracies obtained when using only the graph-based pipeline.Results of a, generalized species tree discordance tests and b, functional benchmark tests.

Fig S6 :
Fig S6: Accuracy of SonicParanoid2 with and without domain-based orthology.SP2 runs denoted with "g" were performed using the graph-based algorithm.Methods shown by markers on Pareto frontier demonstrate the best balance between precision and recall.Square markers represent accuracies obtained when using only the graph-based pipeline.Results of a, reference-tree based tests, b, species tree discordance tests (eukaryota, fungi, bacteria), and c, Vertebrate Gene Nomenclature Committee (VGNC) test.

Fig S7 :
Fig S7: Accuracy of SonicParanoid2 and other 14 methods based on the aggregate ranking from the three classification methods of the QfO benchmark.For each test in the benchmark suite, a method was assigned to group 1 if it had very high precision and recall (e.g., close to "optimal performance" corner), and group 4 if it was close to the "suboptimal performance" corner.Based on a total of 36 rankings (12 from each classification method) participants were sorted on the percentages of tests they were assigned to each group.Blue labels represent SonicParanoid2 executions which included the domain-based orthology inference, whereas purple labels refer to runs in which only graph-based orthology was performed.

Fig S8 :
Fig S8: Accuracy of SonicParanoid2 and other 14 methods based on diagonal quartiles classification from the QfO benchmark.For each of the 12 tests, participants were assigned a group based on their distance from the "optimal perfomance" corner.Methods were sorted on the percentages of tests they were assigned to each group.Blue labels represent SonicParanoid2 executions which included the domain-based orthology inference, whereas purple labels refer to runs in which only graph-based orthology was performed.

Fig S9 :
Fig S9: Accuracy of SonicParanoid2 and other 14 methods based on K-means clustering classification from the QfO benchmark.For each of the 12 tests, participants were separated into 4 clusters (groups) using the K-means algorithm.Methods were sorted on the percentages of tests they were assigned to each group.Blue labels represent SonicParanoid2 executions which included the domain-based orthology inference, whereas purple labels refer to runs in which only graph-based orthology was performed.

Fig S10 :
Fig S10: Accuracy of SonicParanoid2 and other 14 methods based on the square quartiles classification from the QfO benchmark.For each of the 12 tests, participants were assigned a group by checking if they were above or below of the first quartiles of the precision and recall.Methods were sorted on the percentages of tests they were assigned to each group.Blue labels represent SonicParanoid2 executions which included the domainbased orthology inference, whereas purple labels refer to runs in which only graph-based orthology was performed.

Fig S11 :
Fig S11: Execution time differences between complete inter-proteome alignments using MMseqs2 at multiple settings.The execution time difference for a specified pair (e.g., A-B) is computed as the absolute value of the execution times of the two interproteome alignments.Each dot represents one of the 3,003 possible combinations for the QfO dataset.

Fig S12 :
Fig S12: Execution time differences between complete inter-proteome alignments using multiple local alignment tools.Y-axis shows execution time difference between inter-proteome alignments for each of the 3,003 possible combinations in the QfO proteome set, using (left to right) MMseqs2, Diamond, and BLAST at the default mode (as in Supplementary TableS3).As in FigureS9, the execution time difference is proportional to

Fig S14 :
Fig S14: Percentage of input sequences used and evolutionary relatedness in graphbased orthology inference (QfO 2020).Y-axis shows the average percentage of original input constituting the essential subsets of each pair of proteomes.For example, if the essential sets for A and B contain 15% and 25% of the original input proteins, respectively, then 20 will be on the y-axis.The percentage of the original input proteins used is directly proportional to the evolutionary relatedness (expressed in terms of the AAI on the x-axis) and agnostic on the alignment tool used.

Fig S15 :
Fig S15: Percentage of input sequences used and evolutionary relatedness in graphbased orthology inference (2,000 bacterial MAGs).Y-axis shows average percentage of input proteins constituting the essential subsets of each inter-proteome alignment.Similarly to the observations for the QfO dataset (Supplementary Figure S12), the average percentage of input proteins used in the essential subsets is directly proportional to the evolutionary relatedness (expressed as AAI).The leftmost plots refer to pairs of MAGs, in which one MAG was obtained from aquatic samples and the other from terrestrial samples.

Fig S16 :
Fig S16: Benchmark results for predictions obtained with and without essential alignments.a, Accuracies of SonicParanoid2 obtained using essential alignments (empty markers) on the generalized species tree discordance tests are highly similar to those obtained using the complete alignments, regardless of the alignment tool and sensitivity settings used.b, The prediction accuracies on functional benchmark tests conducted using the essential and complete alignments are extremely similar and do overlap in many cases.

Fig S17 :
Fig S17: Benchmark results for predictions obtained with and without essential alignments.Empty markers represent predictions obtained using essential alignments.a, Prediction accuracies on gene-tree-based benchmark tests tend to overlap, which is likely due to the small sizes of the benchmark sets.Accuracies overlap on species tree discordance tests in b for fungi and bacteria, which use relatively small test sets.Accuracies based on eukaryote dataset appear more similar at higher sensitivity settings.

Fig S18 :
Fig S18: Orthologous domains typically missed by BBH.Proteins involved in fission or fusion events during evolution may contain orthologous relationships at the domain level (black arrows) that may be missed by BBH.a, Gene fission event in which a multi-domain protein from species X has its orthologous domains separated into two single-domain proteins in species Y. b, Gene-fusion event in which two domains in two different proteins in species X have their orthologs arranged into a single protein in species Y.

Fig S19 :
Fig S19: QfO benchmark results for SonicParanoid2 with and without domain-based orthology inference.Results shown by empty markers were obtained using the graphbased pipeline only (Figure 1a), whereas those shown by filled markers include orthologs

Fig S20 :
Fig S20: QfO benchmark results for SonicParanoid2 with and without domain-based orthology inference.Results shown by empty markers were obtained using the graphbased pipeline only (Figure 1a), whereas those shown by filled markers include orthologs

Fig S21 :
Fig S21: Dataset used to generate training samples and labels for binary AdaBoost classifier.Average size of the proteomes is 5.50 million AA, with Glicine max (21.73 million AA) and Nanoarchaeum equitans (0.15 million AA) being the largest and smallest, respectively.

Fig S22 :
Fig S22: Example of architecture estimation and document creation for a single input protein.a, Search the protein in the Pfam profile database using MMseqs2.b, Only hits with a bitscore exceeding 30 and domain coverage (t-coverage) exceeding 75% are used to build the architectures.c, Representation of the architecture as a sequence of domains, document, and architecture features.Black blocks in the architecture represent unannotated regions.

Fig S23 :
Fig S23: Selection of candidate orthologs in the domain-based pipeline.Maximum cosine similarities for each pair of architecture from proteomes A (5 proteins) and B (6 proteins) are selected.The pairs of architectures with maximum cosine similarity are selected row-wise (top-left) and column-wise (top-right).Pairs corresponding to entries in the final matrix (bottom) will populate the ortholog clusters generated by the domain-based pipeline.